Memory-based morphological analysis and part-of-speech tagging of Arabic
نویسندگان
چکیده
Memory-based learning has been successfully applied to morphological analysis and part-ofspeech tagging in Western and Eastern-European languages (Daelemans et al., 1996; Van den Bosch and Daelemans, 1999; Zavrel and Daelemans, 1999). With the release of the Arabic Treebank by the Linguistic Data Consortium, a large corpus has become available for Arabic that can act as training material for machine-learning algorithms. The data facilitates machine-learned part-of-speech taggers, tokenizers, and shallow parsing units such as chunkers (Diab, Hacioglu, and Jurafsky, 2004); cf. Chapter 9. However, as argued and illustrated throughout this book, Arabic offers special challenges for data-driven and knowledge-based approaches alike. An Arabic word may be composed of a stem consisting of a consonantal root and a pattern, and may furthermore contain affixes and clitics. Arabic verbs, for instance, can be conjugated according to one of the traditionally recognized patterns. There are 15 triliteral forms, of which at least 9 are common. They represent very subtle differences. Within each conjugation pattern, an entire paradigm is found: two tenses (perfect and imperfect), two voices (active and passive) and five moods (indicative, subjunctive, jussive, imperative, and energetic). Arabic nouns show a comparably rich and complex morphological structure. In this chapter we explore the use of memory-based learning for morphological analysis and part-of-speech (POS) tagging of written Arabic. The next section summarizes the principles of memory-based learning. Section 3 describes the data used throughout the study for both tasks. The subsequent three sections describe our work on memory-based morphological analysis (Section 4) and its integration with part-of-speech tagging (Section 5). The final Section 6 contains a short discussion of related work and offers an overall conclusion.
منابع مشابه
Memory-Based Morphological Analysis Generation and Part-of-Speech Tagging of Arabic
We explore the application of memorybased learning to morphological analysis and part-of-speech tagging of written Arabic, based on data from the Arabic Treebank. Morphological analysis – the construction of all possible analyses of isolated unvoweled wordforms – is performed as a letter-by-letter operation prediction task, where the operation encodes segmentation, part-of-speech, character cha...
متن کاملسیستم برچسب گذاری اجزای واژگانی کلام در زبان فارسی
Abstract: Part-Of-Speech (POS) tagging is essential work for many models and methods in other areas in natural language processing such as machine translation, spell checker, text-to-speech, automatic speech recognition, etc. So far, high accurate POS taggers have been created in many languages. In this paper, we focus on POS tagging in the Persian language. Because of problems in Persian POS t...
متن کاملACL - 05 Computational Approaches to Semitic Languages
We explore the application of memorybased learning to morphological analysis and part-of-speech tagging of written Arabic, based on data from the Arabic Treebank. Morphological analysis – the construction of all possible analyses of isolated unvoweled wordforms – is performed as a letter-by-letter operation prediction task, where the operation encodes segmentation, part-of-speech, character cha...
متن کاملJoint Arabic Segmentation and Part-Of-Speech Tagging
Arabic has a very complex morphological system, though a very structured one. Character patterns are often indicative of word class and word segmentation. In this paper, we e xplore a novel approach to Arabic word segmentation and part-of-speech tagging relying on character information. The approach is lexicon-free and does not require any morphological analysis, eliminat ing the factor of dict...
متن کاملArabic Tokenization, Part-of-Speech Tagging and Morphological Disambiguation in One Fell Swoop
We present an approach to using a morphological analyzer for tokenizing and morphologically tagging (including partof-speech tagging) Arabic words in one process. We learn classifiers for individual morphological features, as well as ways of using these classifiers to choose among entries from the output of the analyzer. We obtain accuracy rates on all tasks in the
متن کامل